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Abstract 



The article is a lightly edited version of my habilitation thesis at the 
University Wiirzburg. My aim is to give a self contained, if concise, intro- 
duction to the formal methods used when off-line learning in feedforward 
networks is analyzed by statistical physics. However, due to its origin, 
the article is not a comprehensive review of the field but is highly skewed 
towards reporting my own research. 
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Chapter 1 

Capacity of the perceptron 



Choosing a weight vector J 6 1* defines a dichotomy of the P inputs £ M £ M. N by clas- 
sifying an input as sgn(J T £ M ). We now calculate the number C(P,N) of dichotomies 
which can be obtained in this manner by an inductive argument due to Schlafli.^J Let 
£ be an additional input and assume that all points are in general position. Let D be 
the number of dichotomies on . . . , £ which can be represented by a weight vector 
J satisfying J T £ = 0. For any such dichotomy we obtain two dichotomies differing 
only on £ by replacing J with J' = J ± e£ and choosing e sufficiently small. Hence 
C(P + 1,N) = C(P, N) + D. Further J T £ = 0, the constraint defining D, means that 
J is confined to an N — 1 dimensional subspace, so D = C(P,N — 1). Finally, the 
recursion 



In particular the perceptron can implement all possible dichotomies, C(P, N) = 2 P , 
only if P < N . But considering large N and scaling P as P — aN one finds 



So in the limit of large N almost all possible dichotomies can be implemented as 
long as P/N < 2. 

The above result was rederived by Elisabeth Gardner using the very different ap- 
proach of Statistical Physics. While Gardner's calculation is more involved, it can be 

1 To simplify the argument, we only count the dichotomies which can be obtained with a J satisfying 
jTgf ^ for all fi. 



C(P + 1,N) = C{P, N) + C(P, N 



1) 



with boundary conditions C(1,N) = C(P, 1) = 2 has the solution 





(1.1) 
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adapted to many related scenarios and in particular is a starting point for analyzing 
learning in multilayer networks. 

For a given dichotomy, let r M € { — 1,1} be the labels of the inputs £ M and we shall 
call the input/output pairs D = t m )}^ =1 the training set. A perceptron with 

weight vector J implements the dichotomy if t^J t ^ > for all /i and for convenience 
we may assume that the Euclidian norm of J equals 1 . In terms of the Heaviside step 
function 

nf , r o ifx<o 

0(x) = < , . 
w \ 1 if x > 

the volume V(B) of the weight vectors implementing the dichotomy can then be written 

as 

r p 

V(D) = / dJ JJ 0(t m J r ^) . (1.2) 

The integration is over the unit sphere in K w and we normalize the measure such that 
V{%) = 1. 

In statistical mechanics one is interested in the properties of V(B) given a distribu- 
tion of training sets D. We shall always assume that the patterns (£ M , t m ) in a training 
set are obtained by independently sampling a random variable (£, t) with values in 
x { — 1,1}. For the storage capacity problem considered by Gardner r is further 
assumed independent of £, and t m = ±1 with equal probability. When averaging over 
all training sets D of size P one then obtains (V(D)) D = (V({(£, T )}))(£ T ) = 2~ p . 



Despite its simplicity this result is remarkable when compared to Eq. (1.1). For 



large TV and P > 2N Eq. (El) means that for almost all training sets V(D) — 0. This 
is not at all reflected in the behavior of (V(B)) D ; so there must be a few training sets 
for which V(U>) is very large compared to 2 _p . Instead of averaging V(B), one thus 
has to consider quantities such as (0(y(D))) D , the probability that a dichotomy can 
be implemented, or (lnV(B)) D , which will diverge if the probability that V(D) = 
is finite. Calculating these averages analytically is, however, quite difficult, but they 
could easily be obtained if one knew (V"(B)) D for all real n. 

The basic idea of the replica method is to calculate the moments of V(B), that is 
to consider (V"(B)) D for the special case that n is a natural number. In contrast to 
general n € R, this case is tractable and it turns out that the expression g(n) for the 
n-th moment thus found can be evaluated for real n and is even an analytical function 
of n. So assuming this analytical continuation to be correct, i.e. (V n (H>)) B — g(n) for 
all positive n, one then for instance obtains the probability that a dichotomy can be 
implemented as (0(V(B))) B = lim n ^ +0 g(n). 

The replica method is not just applicable when the random variable in question is 
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a volume and we shall straight away consider the more general form 

p 



Z{B) = fdjf[ F{t^J t ^) 



(1.3) 



so Z(I 



V(p) for the special case that F is the O-function. We assume that F is 



nonnegative and that the RHS of Eq. (1.3) as well as some related integrals are well 
defined. The name replica method is motivated by the fact that Z n (V>) is an n-fold 
integral for integer n: 



/p n 
dJ Y[ II F{T^J aT ^) , 



f_L—l a—1 

where dJ = JT dJ a . For the moments of Z(B) one then has 



(Z"(D)) D = jAi(f{F{rr T 



(1.4) 



since the examples are independent. To evaluate the average one has to make some 
assumptions about the distribution of the inputs and it is simplest to assume that 
the components £j of £ are i.i.d. Af(0, 1), that is independent and Gaussian with 
zero mean and unit variance. Then the distribution of the inner products J aT £ in 



Eq. (1.4) is Gaussian as well, with zero mean and covariances ^ J a £J" £y = J a J . 

Consequently if X(Q) is an n-dimensional Gaussian of zero mean and with a covariance 
matrix Q satisfying Q a b = J aT J b one has 



Y[F(rJ aT 



HF(rX a (Q)) 



II F ( X a(Q)) 



(1.5) 



(€,• 



X{Q), 



X(Q) 



where the last equality holds because X(Q) and — X(Q) have the same distribution. 
Since the integrand in Eq. (1.4) depends on the weight vectors J a only via their 
overlaps J aT J b , it is convenient to transform the integration variables. This is best 
done by multiplying Eq. (1.4) with 

JdQ5(Q- J T J) -/dQ J] S(Q ab -J aT J b ) = l (1.6) 

a<b<n 

and changing the order of integration. The integral over Q runs over the symmetric 



and positive definite matrices with Q aa = 1. Combining Eqs. (Jl .4|,|l . 5 1.6) then yields 



(Z n (B)) B = J dQD n (Q) / J] F(X a (Q)) 



X(Q) 
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where D n (Q) = JdJS(Q — J T J). In the appendix a simple derivation is given that 
D n {Q) = -D n (l)(det Q)^ - ™ -1 )/ 2 where 1 is the n by n identity matrix. Thus, setting 
P = aN, 



<Z"(D)) D = £>„(!) J dQ ((detQ)"'"™ (f[ F(X a (Q)) 



Now the integration is over n(n — l)/2 dimensions and this number does not increase 
with N. We may thus use that the Ljv-norm converges to the maximum norm with 
increasing N to find 



Jim ATHn (Z n (D)) D = max ilndet Q + aln/f[ F(X a (Q)) 



(1.7) 



X(Q) 



Solving this extremal problem for general Q is quite difficult and one thus restricts 
the search to a small subspace of all possible matrices Q. If one assumes that the 
extremal problem has a unique solution Q* , all off diagonal elements of Q* must have 
the same value since the set of solutions is invariant under permutations of the replica 
indices. This is known as the replica symmetric assumption. 

One is thus lead to consider n by n matrices M n (u, v) with diagonal elements equal 
to u and off diagonal elements equal to v. A simple calculation shows that (1,1,... ,1) T 
is an eigenvector of M n (u, v) with eigenvalue u + (n — X)v and that the matrix further 
has n — 1 linearly independent eigenvectors of the form (1,0,... ,0,-1,0,... ,0) T with 
eigenvalue u — v. Thus det M n (u,v) = (u+ {n — \)v)(u — v) 71 ^ 1 . Assuming replica 
symmetry, Q* — M n (l, q), and we can simplify the first term in Eq. (1.7). 

Further if the n+l random variables zq, . . . , z n are i.i.d. A/"(0, 1), the covariance ma- 



,_ + vzq, a = 1, . . . , n, is just M n (u , v ). 



trix of the linear combinations X a 
For Q* = M(l,q) this observation enables us to factorize the average in Eq. (1.7) as 



Y[F(X a (Q*j) 



X{Q*) 



= (j[F(^T^z a + ^z )^ 



So in replica symmetry we obtain the important intermediate result 

/ 7"/ 



lim iV In {Z n {B)) B = max f(n, q) , 

TV— >oo q 



where 



(1.8) 



f(n, q) = -ln(l + (n - l)g) + -^-ln(l - g) + aln ^F(y/T^ Zl + \/q*o)) g J 



(1.9) 
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Now f(n,q) is well denned for any nonnegative value of n and not just for integer n. 
We can thus consider the analytical continuation to values of n close to zero. Here, 
one has to be rather careful since f(n,q) does not depend on q when n = 1, and in 
particular /(l, q) = — aha. (F(zq)) z ■ Expanding f(n, q) around n = 1 thus yields 

f(n,q) « -aln (F(z )) Zo + (n - l)/i(l,<?) 

where /i is the partial derivative of / w.r.t to the first argument. So if q* maximizes 
/i(l, q) and if n is greater but close to 1, f(n, q*) will be a good approximation to the 
maximum of f(n, q). But by the same argument f(n, q*) will be a good approximation 
to the minimum of f(n,q) when n is close to but smaller than 1. When looking for 
a function q{n) such that f(n,q(n)) is analytical, one will at least want q(n) to be 
continuous at n = 1. Hence f(n,q) must be minimized when n < 1. So for small 
positive values of n we obtain 

lim 7V- 1 ln<Z"(D)) D = O(n 2 )+nmin/ 1 (0,g), (1.10) 

N^oo q 



and from Eq. (1.9) 



/i(0,«) = 2 



5 f^+M 1 -?)) +c^ln(F(yi — 'qzi + VQZo))) 



The first term in the above sum is often called the entropy term since it is determined by 
the constraints on the weight vectors of the perceptron. In the present case, continuous 
and normalized weight vectors. The second term, which depends on the choice of F, 
is called the energy term. 

We now specialize to Gardner's case where Z(V>) is the volume V(U>) and F(x) — 
Q(x). Introducing the function H(x) = (Q(zi — x)) Zl which is closely related to the 
error function, the expression for /i simplifies to 

/l(0, q) = \ (j^ + ln(l - q)j + a (luff {~z^/^T~q )) ^ . (1.11) 

Since the probability that a dichotomy can be implemented by the perceptron is 
(9(y(D))) D = lim„^ +0 ln(V A "(P)) D , by Eq. ( [TTc| ) this probability approaches 1 in 
the limit of large N if min g /i(0, q) > — oo; otherwise it is 0. Similarly for (lnV(B)) IJ , 
using that this is the derivative w.r.t to n of In (V ra (B)) m at n = 0, one has 

lim iV- 1 (lnl/(©)) D = min/ 1 (0, 9 ). 

N^oc q 

To find the critical value a c where the minimum of f± (0, q) diverges to — oo, note that 
/i(0, q) only diverges for q = 1. The average (lnff (— ZQ^fqj yj\ — q )) can be calculated 

analytically for q — > 1 using that for large positive arguments H(x) ~ ~^U2§ " ' 
whereas H(~oo) = 1. This yields that for q — > 1 



(lnff(-zoV?/v / l^?)) 2 



1 q 



41- g 
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and the final result that min 9 /i(0, q) is only finite if a < a c — 2. 

In the limit of large N, almost all dichotomies can be implemented by the perceptron 
up to a c , but the fraction of implcmentable dichotomies is vanishingly small when 
a > a c . In this sense Gardner's calculation exactly coincides with the result obtained 
by Schlafli. Further the fact that (0(V(O))) D converges to a step function, implies that 
6(V(D)) is selfaveraging: for any e > the probability that \Q(V(B)-{Q(V(B))) B \ > e 
vanishes in the large N limit (except at a c = 2). A simple scaling argument shows that 
iV _1 lnV r (D) should be selfaveraging as well. The variance of lnV^O) is equal to the 
second derivative w.r.t. to n of In (V n (IS))) evaluated at n = 0. When using replicas to 
find an analytical continuation to small n, one must obtain that the second derivative 
of N-Hn ((V n (B))) B is finite. This yields that the variance of N^liiViD) is 0(l/N). 

To round off Gardner's calculation we consider the physical interpretation of the 
parameter q, relating it to the mean of all weight vectors implementing a given di- 
chotomy: 

„ aJV 

j(d) = y-^D) dj j Yl e(r^j T ^). 

One can show that ^|| J(W)\\ 2 } D — > q* with increasing N, where q* is the value mini- 
mizing /i(0, q). Further, the averaged squared length of J(B) can be written as 

/ P 2 

(|| J{0)\\ 2 ) D = ( V- 2 (B) f dJ^J 2 J lT J 2 f[ f[ Q{T»J aT ^) 

It can thus also be regarded as the average overlap of two perceptron weight vectors 
J 1 and J 2 , picked at random from the set of all perceptrons which implement a given 
dichotomy. This physical interpretation of q will be derived in Section 5.3, in the 
context of discussing the relationship between the parameterization of the matrix Q 
and the distribution of the overlaps J lT J 2 . As a consequence we shall also find that 
|| J(D)|| 2 is selfaveraging if the replica symmetric parameterization is correct. 




Chapter 2 

Extensions and Ramifications 



2.1 Beyond capacity 

If the number of patterns in the training set D is too large, no perceptron will exist which 
implements the dichotomy perfectly. One may, however, still try to find a network a 
which makes few mistakes on D, so a should have a small training error 



To adapt Gardner's calculation one first defines a probability density on the class of 
all networks by 



where the partition function Z(D) assures that the density is normalized. The pa- 
rameter [3 is called the inverse temperature, and a network drawn from p(cr) will have 
minimal training error in the limit of large f3. Note that one can calculate the average, 
w.r.t. p(cr) 1 of the training error from the derivative of lnZ(D) with respect to (3. 

In the context of a dynamical interpretation, p(u) is the stationary distribution of 
a suitable Langevin dynamics and —(3~ 1 \nZ(H)) plays the role of a free energy. But I 
shall not be concerned with such an interpretation here. 

Choosing F as 



p 




(2.1) 



p{a) 



(2.2) 



zip) 



F(x) = e -/»®(-*) 



in the case of the perceptron yields 




/ 



p 



dJe -pPco(<T.r) 



dJ Y[ F(t^J t C) • 



(2.3) 
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The second expression is the same as in the definition of Z(B>) used in Gardner's calcu- 
lation, Eq. (1.3). So, in replica symmetry, we have already calculated TV -1 (lnZ(B)) D . 
However, major complications arise from the fact that the assumption of replica sym- 
metry is wrong for a > a c . This is extensively reviewed in [ [Tof . I shall discuss the 
techniques for dealing with broken replica symmetry in the context of multilayer net- 
works. 

Considering the case of finite (3 helps to deal with a technical problem in the above 
exposition of Gardner's calculation. From Schlaflis result we know that for finite N 
and a > 1 there is a finite probability that V(B) = 0. So (lnV(D)) £) diverges, and 
the expression we found for the large N limit of TV -1 (lnV^B))^, is surely incorrect in 
the range 1 < a < 2. This is actually quite pleasing since our result reflects the fact 
that the probability that V(B>) = vanishes in the large N limit for a < 2. To make 
sense of the calculations, however, one should use Eq. fl2.3|) at a finite value of (3 and 
when considering TV" 1 (lnZ(D)) D first take the limit of large N and then the limit 
f3 — > oo. In the replica calculation the two limits commute since for finite N we are 
only calculating the moments of Z(D). 



2.2 Discrete weight vectors 

Upto now we have assumed that the components of the weight vector can take on any 
real value. In numerical calculations, however, the set of possible values will be finite, 
even if it can be quite large. We thus assume that the vector J is restricted to lie 
in a finite subset L of M. N and consider the number Ml(D) of networks from L that 
implement a given dichotomy: 

p 

m L (d) = n ©(^ T H ■ 

As for continuous weights it is again instructive and simple to calculate the average, 
(Ml(D)) d = 2~ p cardL. For P = aN the average will become zero for large N unless 
the number of networks increases at least exponentially with N. We thus assume that 
there are L possible values for each weight and so cardL = L N . Then for large N one 
finds (Ml(B)) d = when a > log 2 L. Since the possible values of A/l(B) are discrete, 
for such an a the probability that a dichotomy can be implemented by a network in 
L becomes zero. This is sometimes called the information theoretic bound, since any 
weight vector in L can be represented using N log 2 L bits. 

It is interesting that in contrast to continuous weights a simple average of AIl(B) 
can yield some insight into the critical capacity. But already for L > 4 a tighter 
bound for the perceptron is a c < 2, treating the discrete case as a restriction of the 
continuous one. To improve on the information theoretic bound, let 4> be an orthogonal 
transformation of M. N and </>L the set obtained applying (j> to the elements of L. Since 



2.3. MORE GENERAL INPUT DISTRIBUTIONS 
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the distribution of inputs is isotropic, for any function /: 

</(M L (P))) D = (/(M^(B))) n . 

Denote by (. . .) . the average over the uniform density on the orthogonal group of K , 
then for any convex function /: 

(/(M L (D))) D = «/(M 0L (B))) D ) < (/«AV(D)) )) d 

Now (M^iJD))) ^ = U(B)cardL and we obtain a simple bound in terms of the spherical 
volume. In particular if L has L N elements we have (M£(B))) D < L Nn (U n (B))) D for 
< n < 1. Since M L (D) is integer, 9(Af L (B)) < Af(B), and we obtain an upper 
bound on the probability that a dichotomy can be implemented by one of the L N 
vectors 

(9(M L (B))) D <L N "(U"(B))) D ■ 

For n = 1 this is just the information theoretic bound but tighter bounds can be 
obtained by evaluating the RHS for n < 1. Using the results for the continuous case, 
one can easily compute the smallest a c (L) with the property that for any a > a c (L) 
the right hand side decays to zero exponentially with increasing TV for some finite value 
n(a). This yields an upper bound on the critical capacity. For L = 2 one obtains 
a c (2) = 0.85 and this bound is close to the value a c = 0.83 found for L = { — 1, 1}^ by 
calculating N^ 1 (lnAlL(B))) D using replica s |13] |. The latter value is in good agreement 
with results from numerical simulations ( fl2l[ ). Further, based on the findings in |pq| 
for equidistant weight values, one will expect the bound a c (L) to be asymptotically 
tight for large L. 

From a conceptual point of view the case of discrete weight is nice because an 
assumption implicit in the interpretation of Gardner's calculation can be avoided. In 
identifying the critical capacity with the divergence of N -1 (lnZ(D)) D as well as in 
commuting the limit n —> with N — > oo in the calculation of (Q(V(D))) D , one 
assumes that for an implementable dichotomy lnV(B) typically is on the order of N. 
In the discrete case this assumption is not needed. The reason for this is of course that 
the information theoretic bound guarantees that below the capacity limit IiiMl(O)) is 
on the order of N. 

2.3 More general input distributions 

Up to now we have assumed that the components of £ are independent and Gaussian. 
But in Gardner's calculation the essential point is not that £ is Gaussian but that the 
field J r £ is Gaussian. Using the central limit theorem, one can argue that this will 
also be the case when the input components are not Gaussian but just independent. If 
further the components have zero mean and unit variance, Gardner's calculation does 
not even have to be modified. 
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It is, however, worthwhile noting that the central limit theorem will not apply for all 
choices of J. While the set of exceptions will have zero measure for large N, so does the 
version space, the set of weight vectors implementing a given dichotomy. Reasonably, 
one will not expect this to be a problem; but it would be difficult to actually show that 
all the important contributions to (lnZ(U))) D do come from the region of state space 
where the central limit theorem holds. 

If one is prepared to live with this, one can argue that the assumption of independent 
input components is too strong.^] We consider the characteristic function of J T £ 



cj(k) = (e 1 ^) 



and for simplicity assume that J is drawn from a isotropic Gaussian distribution with 
the normalization (|| J|| 2 ) = 1. Then, for the average value of c,/(fc) one immediately 
finds 

= / p -^ 2 U\\ 2 /n\ 



(cj(k))j = ' < 



We now assume that ||£|| 2 is selfaveraging with mean N and not too malicious, so that 
(cj(fc)) j — e _ 5 fc 1 i.e. a Gaussian. Generically cj(fc) will thus also be Gaussian if cj(k) 
is selfaveraging. For its second moment we obtain 

( cj(fc)2)j = ( e -h#\\e+e\\*/N} = ^^(\\ef + wef)/N e -k 2 e T e/N^^ 

where f and £ 2 are independent and have the same distribution as £. For a large class 
of distributions, £ lT £ 2 is sufficiently small compared to N so that for large N: 

( cj ( k f) J - (e-^cii^'+iia 2 )/^^ = <cj(fc)) 2 

In this case, the variance of Cj(k) vanishes for large N, cj(k) is selfaveraging, and 
typically J T £ becomes Gaussian. 

Even if one will thus expect that it is not really necessary, in the sequel I shall 
nevertheless assume i.i.d. Af(0, 1) input components for brevity and simplicity. 



1 The following argument is due to Manfred Opper, personal communication. 



Chapter 3 

Learning a rule 



In the capacity problem one assumes that the outputs in the training set are inde- 
pendent of the inputs. For pattern recognition, however, one is mainly interested in 
the performance of the network on inputs which were not used for training. This only 
makes sense if one assumes that the desired output is not random but depends on 
the input, and that this dependency is learned by the network based on the training 
examples. So while the training data still consists of P independent samples of the 
random variable (£, t), one no longer assumes that r is independent of £. One scenario 
is that the desired output r is a binary function &(£) of the input £, and b is then 
sometimes called the teacher. One can then measure how well a student, i.e. a network 
a, approximates the input/output relationship by defining the generalization error 

e g (a) = <e(-r<7(0)) (| , T) , (3.1) 

which is just the probability that (£, <r(£)) ^ (£,t)- Training then amounts to finding 
a student which makes few mistakes on the examples D, and this is measured by the 
training error 

p 

A key question in formal learning theory is to which extent minimizing en is conducive 
to the actual goal of minimizing e g . 

In the case that the network is a perceptron, as simple model is that b can be 
implemented by a perceptron, i.e. 

6(0 - sgn(B T 

with a suitable weight vector B € R N . If the input components are i.i.d. N(0, 1) as in 
the capacity problem, it is simple to calculate the generalization error of a perceptron 
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e g {crj) = (9(-B T ^J T ^))) c =iarccosB T J. (3.2) 

Here, and in the sequel, the weight vectors are normalized to 1. 

Since the teacher is a perceptron, in contrast to the capacity problem, V(O) is 
nonzero for all training set sizes P. One can, however, ask whether a student J can 



generalize badly but still achieve zero training error. In view of Eq. (3.2) generalizing 
badly means that B T J is small and one thus consider the restricted volumes 

r p 

V R (B)= dJ5{R-B T J) JJ e(T^J T f). 
■> M =i 

When a = P/N is sufficiently large, one will expect that 0(Vr(ED)) vanishes unless 
R exceeds a critical value R c . So a student with zero training error must have a 
generalization error smaller than ^ arccos R c and this is sometimes called the worst case 
generalization behavior. One may also consider the expected generalization behavior 
by computing the value R* which maximizes lnVfj(D). This yields the most probable 
generalization error of a student picked at random among all perceptrons with zero 
training error. Since, as in the capacity problem, iV^lnVfj^ID)) is on the order of 1 
unless it diverges, for large N the volume corresponding to R* is much larger than that 
of any other value of R, and the most probable generalization error is in fact observed 
with probability 1 in this limit. 

By the same arguments as in the capacity problem 8(Vr(D)) and jV -1 rnVR(B) 
are selfaveraging, and the generalization behavior is obtained by a straightforward 
adaptation of Gardner's calculation. We again consider the more general form 

r p 

Z R (D) = dJ 6(R - B T J) \\ F(t"J t ^) (3.3) 
■> n=l 

and obtain for its moments 

/In 

djmF(B T tr T 0) ]JS(R-B T J a ). (3.4) 
\a=l I £ a=l 

The n+l random variables X n+ i = B T t; and X a = J aT £ are Gaussian with a covariance 
matrix 



Q 



Q R 
R T 1 



here Q is the n by n matrix Q a {, = J aT J b and R = (R, R, . . . , R) T is an n-dimensional 
vector. So 

p 



(Z%(B)) n = J dQD n (Q,R)(f[F(X n+1 (Q)X a (Q)) 



X(Q) 
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where D n (Q, R) = JdJ S(Q - J r J) n"=i 6 ( R ~ B T J a ). 

To calculate D n (Q, R), note that by rotational symmetry it is invariant to the choice 
ol B as long as ||B|| = 1. So averaging over the uniform density on the unit sphere 
yields 

D n (Q,R) = J dB D n (Q,R) = D n+1 (Q) = D n+1 (l)(dctQ)( N - n -V/\ (3.5) 

The expression for det Q can be simplified since for a square block matrix ( a \ ) with 
invertible square matrices a and d 

det(^) = det(ct- 6 cT 1 ^ det d (3.6) 



holds Q. Thus det Q = det(Q - RR 



To evaluate Eq. for large N and P = aN, we again assume that the value 

of Q maximizing the integrand is replica symmetric, i.e. Q = M n (l,q). So it is 
straightforward to evaluate D n (Q, R) since det Q — det M n (l — R 2 , q — R 2 ) . Further, 
at the maximum X(Q) can be rewritten in terms of i.i.d. iV(0, 1) random variables 
z o, ■ ■ ■ : z n as 



X a (Q) = Rz^i + \J q - R 2 z Q + a/1 - qz a and X n+X (Q) = Z-i 



Then at the maximum the average in Eq. (3.4) simplifies to 
(f[F(X n+1 (Q)X a (Q))\ 



\a=l 



X(Q) 



/IF (z-iiRz-! + V? - R 2 *o + v/T^zi)))" \ 

where the second expression makes sense also for noninteger n. 

As in the capacity problem one now uses an analytical continuation to find for small 

n 

lim ^ 1 ln(Z^(D)) D = 0{n 2 ) +nmin g(R,q) , (3.7) 

N-^OO q 



with 



9(R,q) = ^(V=7 +ln(1 -' ?) ) 



In (W z-!(Rz-! + y/q - R 2 z + ^T^ 



qzi. 



1 Note that (J 1 i> )(ci) = ( 1 b ° i ) ■ ^ n equation, it is trivial to obtain the determinant of two 







( 1 a-H\ 




(S5) = 


Kd-^c i J 
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CHAPTER 3. LEARNING A RULE 



Specializing to the case that F(x) = Q(x) allows the minor simplification of rewrit- 
ing the zi-average in terms of H(x). Then the analysis of (3.7) shows that the gener- 
alization error decays to zero as 1/a in the worst case as well as in the expected case 
and that only the pre-factor differs in the two cases [|[ |ll]| . However, a much larger 
difference between the two scenarios has been found for some multilayer networks ( (2^| ) . 

That the teacher is a perceptron, is a rather unrealistic assumption. In a more 
general case it may be impossible to find a network which has zero training error. A 
reasonable strategy is then to look for a network with minimal training error. 

A simple model of such a situation is that student and teacher are perceptrons but 
the output of the teacher is corrupted by noise. The cases of additive and multiplicative 
noise have been widely analyzed [jll| |l9[ [l8| [32|. In the first case r = sgn(B T £+ri) where 
the noise term r\ is independent of £ and typically assumed N(0, v). For multiplicative 
noise r = sgn(B T ^)r], r\ is ±1 and again independent of £. In this case one will 
reasonably assume that the mean of r\ is positive so that r equals the uncorrupted 
output sgn(£? T £) with a probability greater than 1/2. 

It is easy to apply the above analysis to the noisy cases since the generalization error 
of a perceptron <jj is still just a function of the overlap R — B T J. To consider the Gibbs 
density p(a) given by Eq. ( |2.2|) ) we use F(x) = e -P p@< --^ in the definition of Z R (B). 
The partition function then is Z(B) = f Q dR Z r(U>) . Now the probability that the 
weight vector of a student <jj drawn from p has an overlap R — B T J with the teacher is 
Zfj(U>)/Z(B>), and the most probable value R* of R is obtained by maximizing Zr(ED). 
Since A r_1 lnZ^(D) and hence N~ 1 \nZ(D) are selfaveraging, in the thermodynamic 
limit R* is again obtained by maximizing (N~ 1 \nZji(J]))') D . 

Major complications arise from the fact that the replica symmetric assumption is 
invalid for sufficiently high (3 if no student with zero training error exist |llj| . This 
problem can probably be avoided within the framework of a Bayesian analysis, which 
yields that in the presence of noise the training error should not be minimized when 
aiming for good generalization. Instead, in the case of multiplicative noise, one uses 
a carefully chosen finite value of the inverse temperature. Unfortunately this value 
depends not only on a but also on the noise level [ll| . Since the Bayesian strategy 
involves many assumptions about what is being learned, one may wish to stick with 
the suboptimal but generally applicable strategy of minimizing eo, and I shall shortly 
describe techniques for dealing with the broken replica symmetry. 



Chapter 4 

Multilayer perceptrons 



As mentioned in the introduction a general two layer network is given by 



<rj,w{£) =5 ^2w k h(£, T J k ) 




k=l 



and such networks have found many applications both in regression and classification 
problems. In statistical physics it has only been possible to analyze these networks in 
the limit where the number of input dimensions N is much larger than the number of 
hidden units K. In this limit one will not expect the few adaptable hidden to output 
couplings Wk to play a major role. Hence one considers so called committee machines 
where the Wk are constant and equal 1. (Sometimes, for the sake of normalization, one 
assumes Wk = l/y/K instead.) 

Formally, the analysis of regression and classification is very similar, and for brevity 
I shall consider only classification here, results for regression can be found in the papers 
7 and 8. For classification the output function is g(x) = sgn(x) and I shall also assume 
that h is the sign function. So in the sequel the term committee machine (CM) refers 
to the class of functions 



Note that, as for a real committee, the output is decided by the majority vote of the 
K hidden units, and we shall assume that K is odd to avoid a draw. Sometimes it is 
convenient to consider a simplified architecture the so called tree committee machine 
(TCM). For the tree the input £ is NK dimensional, composed of K vectors € R w , 
and 




(4.1) 




(4.2) 
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This is simpler because the fields £^Jk are now statistically independent if all input 
components are independent]^] 

For the committee machine (CM) it is interesting to consider the effect of correla- 
tions between the fields £ T Let us assume that </& = pwo + y/l — p 2 Wk with orthonor- 
mal vectors Wj and p > 0. Then one will expect that quite often <jj(£) = sgn(£ T wo). 
Indeed, by the law of large numbers, — sgn(^ T wo), will hold with a probability 

approaching 1 in the limit of large K if the input components are i.i.d. iV(0, 1). So for 
any finite value of the correlation p the output of the committee becomes identical to 
that of a perceptron with weight vector wq in this limit. Hence in many contexts one 
will expect p to be small when K is large. 

We now turn to the capacity problem for these architectures setting 



r p 

Z(3)= djl[F(^aj(e)), (4.3) 

and as for the perceptron 

\o=l / (£ )T ) 

Now dJ refers to an integration over Kn unit spheres in Mr. Further we define the 
Kn by Kn order parameter matrix Q as Q^f = J£ 7 jf for a, b = 1, . . . , n and k, I = 
1, . . . , K and an Kn dimensional Gaussian X(Q) with zero mean and covariances 

(xm)XKQ)) = { ffH 2| TCM . 

The value of <Jja(£) is determined by the values of £, T J% for the CM and by £jJ£ for 
the TCM. Thus 

(f[F(raj a (0)) =(fi F (sgn£>gT«(Q))) > ) \ . 
\o=l / (5,r) \«=1 \ k=l II x(Q) 

Defining the load parameter as a — and using the same arguments as in the case 
of the perceptron yields 

ln(Z"(D)) 
lim 

JV^oo KN . _ , 

X(Q) 

(4.4) 



max ^Indet Q + aln (f[ F ^S n (E s S n (^(Q)))) 



1 In much of the literature the definition of the tree committee assumes N/K dimensional Jk and 
so that the number of free parameters is N and not NK as in the above definition. But this 
difference is immaterial as long as final results are expressed in terms of the ratio of examples to free 
parameters. 
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In principle one could now adopt an, e.g., replica symmetric parameterization of Q 
and obtain an analytic continuation to small n. One then still has a large number of 
order parameters and the extremal problem involves a K-io\A integral which has to be 
done numerically. While it would probably be feasible to solve this problem for K = 3, 
to my knowledge no one has done this. 

To simplify the extremal problem, it is convenient to view Q as a K by K block 
matrix indexed by the site indeces k, I and consisting of blocks which are n by n 
matrices. For the TCM the energy term does not depend on Qj*f if k ^ I and thus 
in this case at the maximum Qjy = 0.^| For both architectures we now make the site 
symmetric assumption that at the extremum: 



Q 



6 kl Q ab + P ab /K 



(4.5) 



which can be more concisely written in block form: Q = Mk(Q + P/K, P/K). As just 
noted P = for the TCM. 

Now dct Q can be evaluated in a way which is analogous to the calculation of 
M n (u, v). Let U, V be n by n matrices and x € M. N then 

l*\ 

M K (U,V) 



I {U+(K-l)V)x \ 
(U+(K-l)V)x 



as well as 



V x J V (U+(K-l)V)x J 



( (U-V)x\ 
-(U-V)x 




K rows 





( 


x \ 








—x 




M K (U,V) 











{ 


o ) 





\ 







/ 



The last equation stays valid if the rows of the argument vector and the resulting 
vector are permuted. We thus obtain a decomposition of M. Kn into a direct sum of K 
n-dimensional eigenspaces of Mk(U, V) and the determinant of Mk(U, V) is just the 
product of the determinants on the eigenspaces: 



det M K {U, V) = Aet{U + (K - 1)V) det([/ - V) 



K-l 



and in particular 



det Q = det(Q + P) det Q 



K-l 



(4.6) 



(4.7) 



2 It suffices to show this for a symmetric 2 by 2 block matrix U 

: det( 



( t b ) , the extension to general K 
fc c ). So over the positive definite matrices, 



is then by induction. Using (3.6) one has det U 
= arg max c det( c t j), since In det U is a convex function on these matrices. The convexity can be 
shown by rewriting (det(Af7 + (1 — X)V))~ 1 '' 2 as a Gaussian integral and applying Holders inequality. 
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Equations (4.4, 4.7) form the basis for the following discussion of committee ma- 
chines. I shall first consider a limiting scenario in which for large K the summation 
over hidden units is exploited to simplify the energy term using the central limit theo- 
rem. This leads to a very simple Gaussian theory which is formally similar to the one 
for the perceptron. The main result is that the storage capacity diverges with K, but 
the theory only yields limited insight into the rate of divergence. But this approach 
is highly suited to the analysis of learning problems where the target outputs are not 

H, |1 ||, fl and As in the case of 



random but given by a rule (see 23, |25|, |24|, [22|, |28|, 
the perceptron, adapting the capacity calculations to a learning problem is relatively 
straightforward and I shall not dwell on this. Instead, in the next chapter, I shall con- 
sider the more precise capacity calculation obtained by taking the n — ► limit for fixed 
K and the interpretation of the replica symmetry breaking found in this calculation in 
terms of the internal representations of the committee machine. 



4.1 Gaussian theory of committee machines 

The main idea here is to simplify the energy term by arguing that the distribution of 

K 
k=l 

becomes Gaussian for large K. For the TCM this is just stating the multidimensional 
central limit theorem since X?(Q) and X b (Q) are independent if As 7^ Z. This does 
not hold for the CM, but assuming the site symmetric parameterization it has been 
shown in |Q that the limiting joint distribution of the Y a is Gaussian. Nevertheless, 
to reduce clutter, I shall only consider the TCM in this section. 

Obviously the mean of Y a is zero and for the covariances one has 

K K 

(Y a Y b ) = K- 1 V <sgn(X fe Q (Q)sgn(A^(Q)) = K~ x V - arcsm(Q^) - - arc S in(Q ab ) . 

k=l k=l 



I have assumed site symmetry for the last equality. So from (4.4,4.5 4.7) we obtain 
ln(2 " (D)>D =max ilndetQ + a ln(n^sgn(y a (Q c )))^) 



lim 



KN 



(4.8) 



where Y(Q e ) is an n-dimensional Gaussian with zero mean and 
(Y a (Q c )Y b (Q c )) = (Q c ) ab = -arcsin(Q afc ). 



The essential difference to the corresponding expression for the perceptron (1.7) is that 
in the energy term the correlation matrix Q is replaced by the effective correlations Q c . 



4.1. GA USSIAN THEORY OF COMMITTEE MACHINES 
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Adopting a replica symmetric parameterization of Q, it is now straightforward to 
take the limit of small n and for F — 9 one finds 

hm <ln y° = mini-?- + Jln(l - q) + a <W {-ZoV¥/V^¥ )) 
K,N—n>o KJ\ qZl — q2 \ I z 

(4.9) 

where again the only difference to the corresponding expression for the perceptron 
( |l . 1 1[ ) is the substitution in the energy term of q by q c — -^arcsing. This, however, 
has a drastic effect on the capacity, since the derivative of arcsing is singular at q = 1. 
As a consequence the energy term, (\nH(— z^Jlf — q c )) , diverges as 1/^1 — q 
in the limit q — > 1 instead of the the 1/(1 — q) divergence found for the perceptron. 
Now the divergence of the entropy term in fl4.9| ) for q — > 1 is no longer balanced by the 
divergence of the energy term. Hence the minimization problem has a solution for all 
values of a and in particular 1 — q scales as 1/a 2 for large a. 

So we have found the important result that the storage capacity of the TCM diverges 
with the number of hidden units K, and in this sense the multilayer perceptron is more 
powerful than the sum of its parts. Unfortunately the calculation yields no information 
on how quickly the capacity increases with K . 

To gain some insight into this question let us consider the accuracy of the Gaus- 



sian approximation leading to the above result. Going back to Eq. (4.4) we set 
S k = (Xl(Q),Xi(Q),... ,A£(Q)) T . Then one can show that G = #" 1/2 £f =1 Sfc 
converges to a Gaussian by calculating the characteristic function l^e lV , where 

V E R K . This yields 

v t g\ = /^v t s 1 /Vk\ K 

I {S k } \ I Si 

/ ((V T S 1 ) 2 ) S I /(y^) 4 \\ \ K 

as the odd terms in the expansion vanish because Si and —Si have the same distribu- 
tion. One then argues that the higher order term can be neglected for large K, and 
this yields that the characteristic function converges to 

e -H(v T ^) 2 ) Sl = e -iv T (^T) 3 v = e -± V T Q * v 

and this, being Gaussian, is the characteristic function of a Gaussian. But now assume 
that the matrix Q° is close to singular, let A be its smallest eigenvalue and V an eigen- 



vector to A. Then in the second term of the expansion ( 4.10 ) the average ((V T Si) 2 ) 
is on the order of A and quite small and the quadratic term only give the leading cor- 
rection if A is large compared to 1/K, that is if XK ^> 1. Otherwise, it only make sense 
to truncate the expansion after the constant term, in essence equating A with zero, or 
to take the term of higher than quadratic order into account as well. 
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In the replica symmetric theory the smallest eigenvalue of Q e is 1 — q c which ap- 
proaches zero with increasing a. It is impossible to equate 1 — q e with zero, since this 
leads to a divergence of the energy term in Eq. (^^). So, since the Gaussian approxi- 
mation ignores the higher than quadratic terms, it can only be trusted if (1 — q e )K is 
large, and the scaling of q with a yields that this requires a <C \f~K. 

On the other hand, if a <C yK, the Gaussian approximation is reliable and thus the 
theory predicts that the capacity of the TCM is at least y/~K. Indeed, using a replica 
symmetric parameterization of Eq. ( iA ) and taking the n —> limit before the large K 
limit, has been shown to yield a \[K divergence of the capacity [|. Unfortunately 
these results are completely wrong as already noticed in J2|, SJ. In |t5) rigorous upper 
bounds on the capacity are derived, which show that the storage capacity of the TCM 
cannot diverge with K faster than log K. 



4.2 Breaking replica symmetry 

It turns out that one does not obtain the correct analytical continuation from integer 
n to n close to zero when using a replica symmetric parameterization of Q. Such a 
phenomenon was first discovered in the quite different context of the infinite range spin 
glass [0 and, after much soul searching among the involved physicist, Giorgio Parisi 
came up with a hierarchical scheme for relaxing the replica symmetric assumption. I 
shall first apply the first level of this scheme to the TCM (one step of replica symmetry 
breaking or just RSB1) and then discuss the physical implications of the approach. 

The basic idea in RSB1 is to partition the n replicas into n/m groups of equal size 
and parameterize Q by setting Q ab equal to q\ if the different replicas a and b belong 
to the same group and to go else. Formally this amounts to writing Q as the block 
matrix 



= M n j m {M m {l,q x ),M m {q ,q )). (4.11) 



It is simple to calculate the determinant of Q by applying Eq. (4.6) to obtain 

(TL — II I TL — III \ n — m 

1 H qo, qi H qo detM m (l - q ,qi - qo)^' . 
mm) 

Our next goal is to decompose the n-dimensional Gaussian Y(Q C ) in Eq. ( |4.SD . 
Note that Q c has the same structure as Q with the qi replaced by qf — - arcsin . 
Because of the partitioning of the replicas, it is convenient to think of the replica index 
a as a two dimensional index a = [u, v]. We define [u, v] = (it — l)m + v for v = 1, . . . m, 
and u = 1, . . . , n/m indexes the different groups of the partition. One can now rewrite 
ya _ y[i»,«] m terms of i.i.d. N(0,1) random variables z, z u , z u,v as 
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So for the Y(Q°) average in the energy term of ( |4.S| ) we obtain 
/f[F(sgn(Y a (Q c )))\ 



2"> u ,2" ,2 



/ / / v m \ n/m \ 

- ^((FCsgnC^gz + y^^z^yT^z 1 ' 1 )))^^)^ ^ 

where the last expression makes sense for noninteger n and m. So, using this continu- 



ation to small n and assuming that F = 0, from Eq. (4.8) we obtain within the RSB1 
Ansatz 

lim (lnZ(TO)) p _ ^.^ G s (q ,qi,m) + G r (q ,qi,m) , (4-12) 

K,N—nx> Aiy 90i9i,™ 

where 

G s = r + ^ — - ln (! - 3i) + T- ln (! ~Ql + ™(<7i - <7o)) , 

m(gi — go) 2m 2m 




C ~1 



Let us first discuss the ways in which the RSB1 parameterization reduces to the 
replica symmetric one. The case q\ = qo = q is obvious, then G s + G r no longer 



depends on m and is the same as (4.E). But also for m = 1 one finds that G s + G r 
is now independent of q\ and equivalent to the replica symmetric expression with qo 
playing the role of q. While one cannot set m = 0, a little algebra shows that in the 
limit m — > the value of G s + G r becomes independent of qo yielding equivalence to 
the replica symmetric case with q\ playing the role of q. 



To solve Eq. (4.12), it helps to first consider the simpler problem of minimizing 



TTl — 1 I 

F(q ,qi,m) = G s (q ,qi,m) + G r (q ,qi,m) ln(l - <?i) - — lnm 

2m Im 

Setting qi = 1, one immediately sees that mF{q, 1, m) is independent of m and equal to 
replica symmetric functional ( |4.9| ). Since for any a > the latter can be made negative 
by an appropriate choice of q, for this choice of q the value of F(q, 1, m) diverges to 
— oo for m — > +0. So minimizing F(q , q%, m) is easy. 

Due to the divergent ln(l — qi) term in G s , one cannot set q± = 1 in the function 
( |4.12 ) we actually want to minimize. But since the divergence is only logarithmic, one 



will expect the optimal values of 1 — q\ and m to be close to 0. Using the asymptotic 
expansion of H (x) for large arguments, makes it possible to simplify the energy term 
for qx — f 1. In the end one finds that the minimization problem has a finite solution for 
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all values of a and that qo — ► q± — > 1 and m — ► with increasing a. The asymptotic 
scalings are 

1 — qo oc a -2 and ln(l — q{) oclnmaa 2 . (4-13) 

In spite of the fact that go ~~ > <7i a $ well as m — ► are replica symmetric limits, the 
prediction for the typical volume is completely different. A super-exponential decay 
is found in RSB1, in contrast to the exponential decay of the volume in the replica 
symmetric theory. 

Most importantly, we obtain a completely different result for the validity of the 
Gaussian approximation. The smallest Eigenvalue A of Q c is now 1 — q\. So in view 



of (4.13) the XK 3> 1 criterion for trusting the central limit theorem translates into 



a <C \/\nK and the Gaussian theory now predicts that the critical capacity is at least 
on the order of \/\aK which is entirely compatible with the rigorous \uK upper bound. 



4.3 The physical meaning of RSB 

To lighten the notation, I shall discuss the interpretation of RSB in the context of 
perceptron learning. Since for the perceptron replica symmetry is broken only beyond 
capacity, the discussion is based on the Gibbs density p(aj) defined by Eq. (2.2). One 
can then consider the probability density Po(q) that the weight vectors J 1 and J 2 of 
two perceptrons drawn from the Gibbs density have an overlap q. The cumulative 
distribution function of ffrj(g) is 

C (q) = J dxP (x) 



J dJ 1 dJ 2 e(q~J lT J 2 ) P (a. n )p(aj2) 

2 P 

Z(B)- 2 [ dJ x dJ 2 9(g - J lT J 2 ) f[ f[ F(r^J aT ^) 

** 1 i 



a—1 fi=l 

We want to calculate the training set average of Cjj>(q) for large N and to this end 
consider the related quantity 



C.Uj. X. a) = / Z(D) n - 2 J djWQefa - J lT J 2 ) [] [] F(r»J aT e) 

a=l fi=l I D 



where Q e (x) = e + O(x). Setting e = 0, for the object of interest to us we have 

(C o (q)) o = C (q,N,0). (4.14) 
But in the sequel we assume that e is positive, taking the limit e — > in the end. 



4.3. THE PHYSICAL MEANING OF RSB 
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For integer n > 2 one has 

C e (q,N,n) = /f dje e (q- J lT J 2 ) [] [] F ( rA1JQT ^)) ■ 

and, using replicas, we evaluate the RHS for general n. To get rid of the special 
treatment of the hrst two indeces, we introduce the function 



and by symmetry 

C e (q,N,n) = U MG e (q,Fj)f[l[F(T»J aT ^)) ■ 

\ a =l I jj 

Transforming the integral to the order parameter matrix Q yields 

C e (q,N,n) =D n (l) JdQG e (q,Q) I (det Q) ^t 1 ^ (f[F(X a (Q)) 



N 



x (Q)j 



The integral will decay to zero with increasing N, and the asymptotic rate of decay is to 
leading order found by Laplace's method j|. This shows that the decay is determined 
by the properties of the integrand in arbitrarily small neighborhoods of its maxima. In 
fact, since e is positive, G t (q,Q) does not vanish, and we need only the neighborhood 
of a maximum Q*(n) of (detQ)^ (Ila=i ^(^a( ( 3)))x(Q) ^ maximum is unique up 
to permutations of the replica indeces. Further G e (q,Q) is piecewise constant, and in 
particular G e (q, Q) — G e (q, Q*(n)) holds in a neighborhood of the maximum, except if 
q is equal to an off-diagonal element of Q*(n). So for a generic value of q we can treat 
G t (q, Q) as a factor constant in Q and find 

C e (g,N,n) = G € (q,Q,(n)) 
N-*«>C € (l,N,n) G 6 (l,Q.(n)) ' 

Thus in any parameterization of (n) which enables us to take n to zero, considering 
the limit e — > 0, we obtain for the cumulative distribution of q 

lim (C n (q)) B = lim G (q, Q*(n)) . 

N^oo n—>0 



We have used ( |4.14| ) and the fact that G (l, Q*{n)) = 1. 

In replica symmetry this yields the simple result that (Cb(<z)) d approaches the step 
function ®(q— q*) for large N , where q* is the stationary value of the order parameter 
for small n. Further, since (Cd(q i )) d converges to a step function with a single step, 
Co(q) is selfaveraging. 
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If Q is the RSB1 matrix (gXj]), one has G (g, Q) = ^fQ(q-qo)+^ET&{q-q-i)' So 
setting n to zero we obtain as the physical interpretation of RSB1 that the cumulative 
distribution of q has two steps: 



A' 



lim (Cj}(q)} = mQ(q - q ) + (1 - m)Q(q - q{) . 



However, when replica symmetry is broken, C®(q) is no longer selfaveraging |jl4[ |l(i[| . 
So, even for large N, not all properties of the single system, can be deduced from those 
of the training set average in this case. 



4.4 Beyond RSB1 ? 

For the TCM discussed in Section 2 the order parameter q refers to the overlap between 
the weight vectors of the same hidden unit of two TCM's in version space. So, the 
RSB1 solution, means that the pattern averaged density of this overlap converges to 
two (5-peaks. Why not three? 

Indeed, it is straightforward two allow for three peaks by parameterizing Q as 

Q = M n/mi (R 1 (m 1 ,m 2 ; l,q 2 ,qi),Ri(m 1 ,m 2 ;qo,qo,qo)), (4.15) 

where R\ denotes an RSBl-matrix, R\ (n, m; a, b, c) = M n / m (M m (a, b), M m (c, c)). Of 
course, one might still think that this RSB2 Ansatz is not general enough, and recur- 
sively continue to construct an RSB-fc parameterization allowing for k + 1 peaks. 

Using the techniques discussed for RSB1, it is straightforward, if somewhat tedious, 
to write down the minimization problem for the typical volume using a, say, RSB2 
parameterization. And I would expect, that for sufficiently large but finite a a higher 
order RSB parameterization does improve on RSB1. But it is not clear that this will 
affect the capacity result, because the higher order solutions can converge to the RSB1 
solution with increasing a. One will in fact expect the g's to converge to 1, that 
is toward a single peak, and already in the RSB1 parameterization, rather extreme 
scalings are needed to construct a non replica symmetric solution. Unfortunately, it 
would be extremely complicated, to show analytically, that any RSB2 solution must 
be degenerate with the RSB1 solution ( 4.1 3| ) in the large a limit. Indeed I have not 



even proven that the solution (4.13) is the unique global minimum in RSB1 space. But 
it would perhaps be worthwhile to numerically track a higher order RSB solution to 
large values of a. 
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4.5 Storage capacity of the CM 

We now obtain an accurate value for the capacity of the CM by taking the limit n — ► 



limit for finite K. Going back to Eq. (4.4) and using the site symmetry assumption 



J5|) we consider the RSBl-theory. Since we are dealing with the fully connected 
architecture, the matrices Q and P are parameterized as: 

Q = M n/m (M m {l - ^, qj), M m (q 0) go)), P = M n/m (M m (p 2 , Pl ), M m (p ,Po)) ■ 

(4.16) 



To decompose the Gaussians X£(Q) in Eq. (4.4), we rewrite a in form of the two 
dimensional index [u, v] as in Section|4^, employ the i.i.d. Af(Q, 1) Gaussians Zk, z%, z^' v 
and set: 



X [ k u ' vl = uz k + vzl + wz^ v +uz + vz u + wz u ' v . (4.17) 



Here the parameters (u,v etc.) have to be chosen so that Q = Mk(Q + P/K, P/K) 
holds for the covariance matrix Q of the X%. The last three summands in ( 4.17 ) 



are needed because the sites can be correlated (P ^ 0). In keeping with our usual 
style of decomposing Gaussians one might expect z, z u ,z u,v to be Af(0, 1) Gaussians 
independent of each other and the other random variables. However, it turns out 
that in the relevant regime the sites are anti-correlated, (X^Xf) < for k ^ /. If 
all random variables in the decomposition are independent, such anti-correlations are 
only possible if, say, u is imaginary. This is probably not really a problem because 
the averages in the energy term lead to ii-functions which do make sense for complex 
arguments. However, I find this too murky, X^ 1 '^ is after all a real valued Gaussian, 
and thus adopt a different definition of z, z u , z u,v , setting: 



K K K 

z = K- 1 Y J Zk, z u = K- 1 J2z U , z u > v = K~ 1 J24' V - 
fc=i fc=i fe=i 



Then a simple calculation shows, that the Xg have the desired covariances if the pa- 
rameters satisfy 



u 2 = q (u + u) 2 = po + qo 

v 2 = qi-qo (v + v) 2 = qi+pi-qo-po 

w 2 = l-p 2 /K-qi (w + w) 2 = 1 -P2/K + P2 - qi -pi ■ (4.18) 



2G 



CHAPTER 4. MULTILAYER PERCEPTRONS 



Using this decomposition the average in the energy term of Eq. ( |4.4| ) can be rewritten 
as: 

(nF(sgn(^sgn(^(Q)))) ) 



fc=i 



X(Q) 



= l((^ F (^& n (^2 S & n ( UZk + VZ k + wz k' 1 + uz + vzl + WZ 14 ))^ \ ^ ^ 

- (((^f Sg n(Esgn(4 Ml )))\ \ \ ■ 

\\\ v fe=1 y/ ^}/ { ^ } / {2fc} 

To lighten the notation in the last equation ( 4.17 ) is used to write the fields in a more 
compact form. 

For the entropy term, we need detQ and the determinant is easily calculated by 
repeatedly applying ( |4.6| ). Specializing to F = O, from the small n limit one then 
obtains: 

(ln^D))o = ^ l Gi({ } { }>m)+ « Gr({gi}jfe} m) (419) 

where 

K — I 1 

G s = S(<<iu 2 ,m) + — S((u + u) 2 ,(v + v) 2 ,(w + w) 2 ,m) 

TflQ, 

S(a, b, c, m) — (m — lllnc + ln(c + m6) H - 

c + mo 

and 

G^/ln/Zefsgn^sgnl^ 11 ))^ \ \ . 
\ \\ V fc =x 'llWi&'M 

The extremum in Eq. (4.19) means that the function has to be minimized w.r.t to all 
order parameters except po- The function has to be maximized w.r.t to po since in 
physical terms po = KJ£ T J£. So po refers to a quantity with a single replica index 
and is analogous to the student/teacher overlap in the context of learning a rule in the 
following sense: Instead of considering the full space of networks, we could have focused 
on sub-shells where J^Jk is constant. The capacity of the full space of networks is given 
by the sub-shell of maximal capacity and this is just what is obtained by maximizing 
Eq. (fll9|) inpo- 

It is impractical to calculate G r in its present form since this involves ZK Gaussian 
integrals. To do something about this, we first rewrite (4.19) in terms of the internal 
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representations i in the committee, that is in terms of the outputs i k of its hidden 
units. We use 

K 

dMh 



i = Tr t ;QeM4 141 ), 



fc=l 



where the trace over i denotes a summation over all i G { — 1, 1} K ■ Multiplying the 
6-function in G r with the above RHS yields: 



K \ K 



G r = / In / / Tr t 9 ^ i k J] 9(^4 

\ \\ Wi J k=i / {^ 1 } / {z i } ' 



Z *i {z k } 

To highlight the dependence of the energy term on w and w, we now define: 

f(i Y k}, {ife}) = ( II QMYj + wzl' 1 + wz 1 - 1 )} \ and Y k = uz k + uz + vz\ + Sz 1 , 

(4.20) 

so 



'ln^Tr.e^^/anKM)^ ^ ^ • 

~~ { Z *} {zk} 



As a approaches the critical capacity the volume of admissible networks vanishes and 
one will expect that w, w — > 0. In this limit the trace is dominated by a single term 
and thus 

(^ ife ) ^™<^») -max 9 (l>) /({^}>Ufe}) m - 

It is too troublesome to locate the maximum as function of the Y k and hence we replace 
the maximization in i by a summation over all possible values of i. So we use that for 
w, w —* 

(Tr t 6 ^ ^ /({F fe }. K») < ^).f({Y k }, {t k }) m . (4.21) 

The simplified energy term 

G r = ^ln^Tr t (e (j^^ /({**}, M))"^ ^ 

~ ( z k} {z k } 
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obtained by commuting the trace and the exponentiation with m is an upper bound to 
the true value of G r for w, w — > 0, and we consider the extremal problem 

extr G s ({q i },{p i },m)+ad r ({q i },{p i },m) (4.22) 
{<?ib{p;},™ 



instead of (]09|). 

While (|4.22| ) is more accessible than the original extremal problem, the remaining 
calculations are nevertheless quite involved and they are described in some detail in 
[ pp{ . Here, I shall just present the key features of the solution. At a critical value 
a c {K) of a which for large K scales as 

1G 



a c (K) -VhiK 

IT — Z 



the solution of (4.22) diverges. At the critical a the stationary values of w and w 



vanish. Consequently the extremal value of (4.22) bounds the extremal value of the 



original problem (4.19) and a c (K) is an upper bound to the true capacity a c (K). An 



interesting question is, whether this upper bound coincides with the critical capacity to 



leading order in K. This is related to the question if the inequality (4.21) is sufficiently 
tight at the stationary point. Since the trace over the internal representations t on its 
LHS is dominated by a single term due to w, w — > 0, the inequality would be tight if 
also on the RHS the trace were dominated by a single term. This, however, is tricky, 
since as a approaches a c (K) one finds that also m — > 0. Consequently as function 
of i the maximum of f({Yk}, {tfc})" 1 on the RHS is not as pronounced as the one of 
just /({Yfc}, {ik}) on the LHS of the inequality. So the tightness of the inequality is 
determined by the ratios of w, w and m as they approach zero. These are calculated in 
[]30f and suggest that while a few terms do contribute to the trace on the RHS, their 
number is not very large, and that the critical capacity to leading order coincides with 
a c (K). It would however require quite intricate combinatorics to actually show that 
this is the case. 

Finally, it is interesting to compare the results for the connected committee to the 
ones for the tree architecture. For the TCM entirely analogous calculations ju^, |l7j 
yield the smaller capacity of ^VhiK. This difference is due to the fact that at the 
critical capacity the weight vectors of the CM are anti-correlated, po = — 1. If one were 
to artificially restrict the state space of the CM so that the K weight vectors are forced 
to be orthogonal, this corresponds to po = 0, the capacities of the CM and the TCM 
would be the same. The usefulness of anti-correlated hidden units is related to the fact 
that in the orthogonal case the output of the CM is quite similar to that of a perceptron. 
In particular, if one considers the perceptron with weights J obtained by averaging the 
weight vectors Jk of the CM, J = J2k=i ^k, one finds that when po = this perceptron 
gives the same output as the CM for approximately 80% of randomly chosen inputs 
even when K is large. However, po = —1 leads to J = 0, the approximating perceptron 
is undefined, and no perceptron improves on random guessing in predicting the output 



4.6. COUNTING INTERNAL REPRESENTATIONS 



29 



of this CM for large K . Since the storage capacity of the perceptron is limited, the 
anti-correlated state maximizes the capacity of the committee. 



4.6 Counting internal representations 

Historically, the capacity of the committee machine was first obtained by counting 
the typical number of internal representation of a training set [fl6| , [HJ and not by the 
RSB1 calculations of the Gardner volume. To round off the analysis of the CM, I shall 
describe the close relationship between the two approaches. 

Given a training set D = {(£ m ,t m )} one can ask whether outputs S { — 1, 1} of 
the hidden units exist which can be (a) realized by the committee and for which (b) 
the output of the committee on is t m . This amounts to asking whether the volume 
of weights 

n ec^Etr o n n . ^ 



f-L— 1 fi—1 k—1 

associated with the internal representation i is nonzero. The are 2( K ~ 1 ) p internal 
representations with the property (b), r u J2k=i b k > 0, but not all of them will be 
realizable by the committee. So the quantity of interest is the typical number of 
realizable representations 

exp <lnTr,e(K(D))) D . 

To obtain the training set average one uses a double replication. Instead of Q(V L (V>)) 
one considers V t (D) m for integer m taking the limit m — > in the end; the second 
replication is used to calculate the logarithm in the usual way. We thus consider 



b (to) 



KN dn 



Tr t y t (D) m j ) (4.24) 

and are mainly interested in S*(0) = lim m ^o S(m). As long as >S'(0) is positive, re- 
alizable internal representations exist, and the storage capacity of the committee is 
not exhausted. For P = aKN, the smallest value a^{K) for which S(0) = marks 
the transition to a regime where the number of internal representations is no longer 
exponential in N. So a^(K) is a lower bound on the capacity and for finite K one 
will not expect the bound to be tight; for instance «d(l) = but the critical capacity 
for K — 1 is 2. It is, however, reasonable to expect that ad(K ) for large K yields an 
asymptotically tight bound since the volume in weight space associated with any single 
internal representation should vanish in this limit. 



To calculate S(m) we use Eq. ( 4.25 ) and obtain in a first step 

Tr.Wr) = /dJnTr,0(r'Tf=i^)n e « j r T ^), (4-25) 

' * fJ,,U U,V 
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where the replica index v runs from 1 to m, and for the other replica index: u — 
1, . . . , h. The symbol dJ refers to Khm integrations over unit length weight vectors 
J™ in M. N . We now have to perform the training set average which, after commuting 
with the weight integral, can be rewritten in term of zero mean Gaussian random 
variables X% V (Q) with covariances 

xr(QW v (Q)) = Qw' v ' = JT T Jt' 

Transforming to an integral over the order parameter matrix then yields 
lim 



N^oo KN 

max aln/nTr t 6(Ef =1 ^)II e (W(Q))) + 



To make further progress we need to parameterize Q. Referring back to Eq. (4.25), we 
see that the weight vectors J™ and Jjji " belong to committee machines which use the 
same internal representation to store the training set if u — vf. So, even when aiming for 
a replica symmetric parameterization, it makes sense to assume that QJ^™ " depends 
on 5 U u' ■ Further, to control the number of order parameters, we need to assume site 
symmetry, that is Q depends on k and k' only via Skk 1 ■ These considerations motivate 
parameterizing Q as Q = Mk(Q + P/K, P/K) where 

Q = M A (M m (l - M m (q , q j), P = M ft (M m (p2,pi), M m (p ,p )) . 



Now, comparing to Eq. ( 4.16 ), we see that this is just the RSBl-parameterization 
used in the Gardner volume calculation for the CM if we equate n = n/m. So we 
have already calculate detQ and the same decomposition of X(Q) into independent 
contributions as in the preceding section can be used. We then obtain the following 
remarkable analogy to the calculation of the Gardner volume: 

S{m) = cxtr G s ({qi},{pi},m) + aG r ({qi},{pi},m) , 

where G s and G r are exactly the same as in the preceeding section. The critical capacity 
ad{K) is given by the condition 5(0) = 0. In the RSBl-calculation the bound a c {K) 
was obtained from the divergence to — oo of (G s + G r )/m when this expression was 
also minimized w.r.t. to m. But minimizing in m yielded that m — > as a approaches 
a c (K) and this is just the limit needed when counting internal representations. So we 
obtain the simple result that 

16 



a d (K) = a c (K) -VhiK 
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In addition we now have a very nice interpretation of the order parameters, e.g. the 
overlaps qi,P2 refer to networks which use the same internal representation to store 
the training patterns, whereas networks with differing internal representations yield 
the overlaps qo,Pi- 

However, all is not well. By definition ad(K) should be a lower bound to the 
critical capacity, but a c (K) is an upper bound. This shows that the above (doubly) 
replica symmetric parameterization is too simple minded, and replica symmetry is 
broken, presumably for networks which use different internal representations. However, 
having argued that both ad(K) and a c (K) are tight bounds in the limit of large K, 
one can reasonably assume that this complication does not invalidate the asymptotic 
findings. This is supported by results in jl^] where the stability of the replica symmetric 
stationary point was analyzed for the TCM when counting internal representations. 
The replica symmetric solution was found unstable for finite values of K but marginally 
stable in the large K limit. 



Appendix A 

The entropy term 



We want to calculate a volume of the form 

D„(Q) = [ rfJ S(Q - J T J) = f di f[ d(Q ab -J aT J b ) (A.l) 

•* J a,b=l(a<b) 

where Q is a symmetric, positive definite (n, n)-matrix of overlaps and J is the (N, n)- 
matrix which is composed of the n vectors J a S JR N . 

For a suitable orthogonal (n, n)-matrix o and a diagonal (n, n)-matrix D one can 
write Q as Q = o T DDo. We now apply the linear transformation J — > J Do to the above 
integral. Its determinant is det and we obtain 



D„(Q) = y dJ <5(o T D(l- J T J)Do) detD^ . 
The Fourier representation of the 5-function yields 

<5(o T D(l - J T J)Do) = C n ( dQ exp (iTr Qo T D(l - J T J)Do 



(A.2) 



(A.3) 



The integration runs over symmetric (n, n)-matrices and C n = (27r)~™( n+1 )/ 2 2™( n_1 )/ 2 , 
where the second factor arises from the fact that the off-diagonal elements are counted 
twice in the trace. Using 



Tr 



Qo T D(l- J T J)Do 



= Tr 



DoQo T D(l- J T J) 



and transforming Q via Q -> o T D- 1 QD- 1 o yields 



<5(o T D(l- J T JDo) 



C^detD"™" 1 / dQ cxp(iTr Q(l-J T J)j)^ 
C„dctD-™- 1 (5(l- J T J) (A.4) 
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and thus D n (Q) = det D N - n - 1 D n (l). Of course det D 2 = det Q, so finally 

D n (Q) = D„(l)(det Q)( Ar -™- 1 )/ 2 (A.5) 

where D n (l) is just a normalization constant. 

The case where one considers an additional (N, m)-Matrix B of m teacher vectors 
and wants to evaluate / di 8{Q — J T J) ^(R — J T B) reduces to the above consideration 
by noting that the integral will not depend on the choice of B, as long as the matrix of 
teacher overlaps T = B T B is held fixed. Thus, one may in addition integrate over all 
B which have correlation matrix T. 
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